R For Data Science Cheat Sheet 


Tidyverse for Beginners 
Learn More R for Data Science Interactively at www.datacamp.com 


The tidyverse is a powerful collection of R packages that are actually 
data tools for transforming and visualizing data. All packages of the 
tidyverse share an underlying philosophy and common APIs. 


The core packages are: 


e ggplot2, which implements the grammar of graphics. You can use it 
to visualize your data. 


e dplyr is a grammar of data manipulation. You can use it to solve the 
most common data manipulation challenges. 


e tidyr helps you to create tidy data or data where each variable is ina 
column, each observation is a row end each value is a cell. 


e readr is a fast and friendly way to read rectangular data. 


e purrr enhances R’s functional programming (FP) toolkit by providing a 
complete and consistent set of tools for working with functions and 
vectors. 


e tibble is a modern re-imaginging of the data frame. 


5 W ° stringr provides a cohesive set of functions designed to make 
ae working with strings as easy as posssible 


e forcats provide a suite of useful tools that solve common problems 
with factors. 





You can install the complete tidyverse with: 
> install .packages ("Eldyverse”) 


Then, load the core tidyverse and make it available in your current R 


session by running: 
> library (tidyverse) 


Note: there are many other tidyverse packages with more specialised usage. They are not 
loaded automatically with library(tidyverse), so you'll need to load each one with its own call 
LO Libary O): 


Useful Functions 


Conflicts between tidyverse and other 
packages 

List all tidyverse dependencies 

Get tidyverse logo, using ASCII or unicode 
characters 

List all tidyverse packages 

Update tidyverse packages 


tidyverse_conflicts () 


tidyverss deps () 
tidyverse logo() 


tidyverse packages () 
tidyverse update () 





Loading in the data 


> library (datasets) 


Load the datasets package 
Load the gapminder package 
Attach iris data to the R search path 


> library (gapminder) 
> attach (iris) 





dplyr ggplot2 


filter () allows you to select a subset of rows in a data frame. 


S>% Select iris data of species 
"virginica" 

Select iris data of species 
"virginica" and sepal length 
greater than 6. 


arrange () sorts the observations in a dataset in ascending or descending order 
based on one of its variables. 


> iris 


filter (Species=="virginica") 


> iris aos 
filter (Species=="virginica", 
Sepal.Length > 6) 


> iris as 
arrange (Sepal.Length) 
> iris %>% 
arrange (desc (Sepal.Length) ) 


Sort in ascending order of 
sepal length 

Sort in descending order of 
sepal length 


Combine multiple dp1 yr verbs in a row with the pipe operator %>%: 
Filter for species "virginica" 


%>% |then arrange in descending 
order of sepal length 


mutate () allows you to update or create new columns of a data frame. 


> iris Boe 
filter (Species=="virginica") 
arrange (desc (Sepal.Length) ) 


Change Sepal.Length to be 
in millimeters 


Create a new column 
called SLMm 


> iris sos 

mutate (Sepal.Length=Sepal.Length*10) 
> iris %5% 

mutate (SLMm=Sepal.Length*10) 


Combine the verbs filter (),arrange (),andmutate (): 


> iris Bos 
filter (Species=="Virginica") 
mutate (SLMm=Sepal.Length*10) 
arrange (desc (SLMm) ) 


Summarize 


summarize () allows you to turn many observations into a single data point. 


Summarize to find the 
median sepal length 
Filter for virginica then 
summarize the median 
sepal length 


> iris 355 
summarize (medianSL=median (Sepal.Length) ) 
> iris 35% 
filter (Species=="virginica") %>% 
summarize (medianSL=median (Sepal.Length) ) 


You can also summarize multiple variables at once: 
> iris %S>% 
filter (Species=="virginica") %>% 
summarize (medianSL=median (Sepal.Length), 
maxSL=max (Sepal.Length) ) 


group by () allows you to summarize within groups instead of summarizing the 
entire dataset: 
Find median and max 


sepal length of each 
species 


> iris Gs 

group by (Species) %>% 

summarize (medianSL=median (Sepal.Length), 

maxSL=max (Sepal.Length) ) 

Find median and max 
petal length of each 
species with sepal 
length > 6 


> iris 3%>% 
filter (Sepal.Length>6) 
group by (Species) %>% 
summarize (medianPL=median (Petal.Length), 
maxPL=max (Petal.Length) ) 


























Scatter plot 


Scatter plots allow you to compare two variables within your data. To do this with 
ggplot2, you use geom point () 


> iris small <- iris %>% 
filter (Sepal.Length > 5) 
> ggplot (iris small, Compare petal 


width and length 


aes (x=Petal.Length, 
y=Petal.Width)) + 





geom point () 


Additional Aesthetics 
e Color 
" aes (x=Petal.Length, 
y=Petal.Width, 

color=Species)) + 





> ggplot (iris _ small, 


geom point () 


> ggplot (iris small, aes (x=Petal.Length, 
y=Petal.Width, 
color=Species, 
size=Sepal.Length)) + 


geom point () 


> ggplot(iris small, 


aes (x=Petal.Length, 
y=Petal.Width)) + 

geom point ()+ 

facet wrap (~Species) 


Line Plots 





> by year <- gapminder %>% 
group by (year) %>% 
summarize (medianGdpPerCap=median (gdpPercap) ) 
> ggplot(by year, aes (x=year, 
y=medianGdpPerCap) ) + 
geom line()+ 
expand limits (y=0) 


Bar Plots 


> by species <- iris %>% 
filter (Sepal.Length>6) 
group by (Species) %>% 
summarize (medianPL=median (Petal.Length) ) 
> ggplot (by species, aes (x=Species, 
y=medianPL)) + 


fo) fo) 
53>% 


geom col () 


Histograms 


> ggplot (iris small, 
geom histogram () 


aes (x=Petal.Length))+ 


Box Plots 


aes (x=Species, 
y=Sepal.Width))+ 


> ggplot (iris small, 


geom boxplot () 
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